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TELESCOPES DON'T MAKE CATALOGUES! 

David W. Hogg 1 and Dustin Lang 2 



Abstract. Astronomical instruments make intensity measurements; any 
precise astronomical experiment ought to involve modeling those mea- 
surements. People make catalogues, but because a catalogue requires 
hard decisions about calibration and detection, no catalogue can con- 
tain all of the information in the raw pixels relevant to most scientific 
investigations. Here we advocate making catalogue-like data outputs 
that permit investigators to test hypotheses with almost the power of 
the original image pixels. The key is to provide users with approxi- 
mations to likelihood tests against the raw image pixels. We advocate 
three options, in order of increasing difficulty: The first is to define cat- 
alogue entries and associated uncertainties such that the catalogue con- 
tains the parameters of an approximate description of the image-level 
likelihood function. The second is to produce a Jf-catalogue sampling 
in "catalogue space" that samples a posterior probability distribution 
of catalogues given the data. The third is to expose a web service or 
equivalent that can re-compute on demand the full image-level likeli- 
hood for any user-supplied catalogue. 

In probabilistic inference, the goal is to transform information gathered by the 
data-taking device into information about the model or parameters of interest, with 
as little loss as possible. The most precise methods for inference involve "forward 
modeling" of the measurements: A model that can generate the data accurately 
is a good model, and is constrained by every data element that it (interestingly) 
■ generates. For this reason, everyone who wants to perform a precise experiment 
J-j \ with a telescope wants to model the image pixels. Recently, this has been realised 
in a range of astrophysics domains, from weak lensing (|Bernstein fc Jarvis 2002[) 
to astrometry (| Anderson et al. 20 08: Lan g et al. 2009[ ). 

Unfortunately, since its earliest days, astronomy has been done through cata- 
logues. Historical data sets were made by eye, but enormous catalogues such as 
USNO-B (IMonet et al. 20031) . 2MASS (ISkrutskie et al. 2006|> . and SDSS 
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(Abazaji an et al. 2009 ) are based on digital intensity measurements. Despite this — 
for practical purposes — the primary data products of those surveys are catalogues; 
the survey teams considered it unimportant to give easy world-wide access to all 
the raw imaging pixels. 

If an investigator wants to know the J-band fluxes of a few million SDSS sources 
using 2MASS (something we often want to know) , the only easy option is catalogue 
matching in which the investigator makes up heuristic matching conditions, tests 
them on a subset of sources, and then runs them on the full catalogues. For 
sources that don't get matched, what has the investigator learned? Very little, 
because it is almost impossible without access to the imaging pixels to determine 
the sensitivity of the 2MASS data to the source at that source position, and even if 
the sensitivity is nominal, a two-sigma measurement (or even zero-sigma) is much 
more constraining than the uninformative statement that the source did not satisfy 
the 2MASS catalogue-inclusion criteria. Furthermore, even when sources do match 
there are fundamental issues about what can be inferred ( jBudavari &: Sza lay 2008 
|Hogg fc Lang 2008| ). 

Most astronomical projects can be phrased as hypothesis tests. Thinking for- 
ward to Gaia, one important set of projects will involve finding streams of stars — 
linear features in phase space — that are likely remnants of disrupted stellar clus- 
ters or galaxies (e.g., Helmi this volume). The detection of these lines of stars 
can be cast as a hypothesis test: If we put these stars onto a family of simi- 
lar orbits, does our explanation of the data improve? Another set of projects 
will involve refinement of Milky Way parameters; these involve optimisation of 
some kind of likelihood, possibly marginalising out the details of the distribution 
functions (e.g., |Bovy et al. 2010a| ). Another set will involve testing the physical 
properties of the velocity-space structure in the Milky Way disk (D ehnen 1998) 
IDe Simone et al. 2004( |Bovy fc Hogg 2010b[ ) ; these make different predictions for 
the Gaia data. The question of which is best is — in part — a question about their 
relative likelihoods under the data. 

This article is about astronomical imaging in general and the data products 
derived therefrom, but the specific examples will be Gaia-related. 



1 A definition of uncertainty 

The simplest option for transmitting image information to the catalogue is to build 
the catalogue such that hypothesis tests against it are identical to — or as close as 
possible to — hypothesis tests against the raw imaging. Not only is this possible, 
it is practical in some cases. 

For example, the current plan for the spectroscopic output of SDSS-III BOSS is 
that each individual BOSS spectrum will be a list of wavelengths Xi and at each of 
those wavelengths a flux value fi and associated inverse uncertainty variance l/of . 
These fluxes and inverse variances will not be simply "best-fit" values; they will 
be the parameters of a model of the likelihood function ( |Bolton fc~S chlegcl 2010): 
Imagine that for this spectrum an investigator has two possible models mi (A) and 
m 2 (A). The fluxes fi and inverse variances 1/af are constructed such that the 
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natural x 2 difference 



m 2 (\i) - ft 



E 



TOi(Ai) - fi 



(1.1) 



is as close as possible to the x 2 difference the investigator would have obtained 
had he or she performed the % 2 test in the raw spectrograph image pixels. Since 
X 2 is related to likelihood by 



A X 2 



-2 In 



(1.2) 



this makes the fi and 1/cr 2 the parameters of an approximation to the pixel- 
level likelihood function. It is an approximation because it assumes Gaussian 
uncertainty, and small wavelength-to-wavelength covariance (this latter point is 
addressed in detail by Bo lton &: Sch legcl 2010]). 

This suggests a general definition for catalogue elements and their associated 
uncertainties, which we explicitly advocate here. In the imagined case of Gaia, 
the definition would be this: For each star on the sky, there will be six (or more) 
parameters, say 

y T = [RA, Dec, w, fi a ,^is, v r ] (1.3) 

and an associated six-by-six inverse covariance matrix C~ . Imagine that for this 
star an investigator has two proposals Y\ and Y 2 about the six parameters. We 
recommend that the catalogue entries y and C _1 be defined such that the natural 
X 2 difference 



Ax 2 = [Y 2 - y} T ■ C- 1 ■ [Y 2 -y}- [Y 1 - y} T ■ C~ l • [Y 1 - y] 



(1.4) 



is as close as possible to the logarithmic likelihood ratio —2 ln[Jz?2/-S?i], now 
marginalised over all calibration and instrument parameters, that you would have 
measured if you had access to the raw image pixels and racks of metal. The 
proposal is to adopt this definition; adoption of this is essentially equivalent to ob- 
taining the catalog entries and uncertainties from Gaussian fits to the marginalized 
likelihood. 

The nuisance-parameter marginalization takes this beyond the BOSS plan and 
makes it challenging. Marginalisation of a likelihood requires — in the mathemati- 
cal sense — a prior probability distribution function (PDF) over the parameters to 
be marginalised out and it looks something like this: 



i?cn = P (d\y) 

= Jd0p(9)p(D\Y,9) , 



(1.5) 



where the object D is the entire raw data set (pixel intensities), Y is the set of 
six parameters of the individual object, and is the set of all calibration param- 
eters in the model, including all those related to attitude and hardware. The un- 
marginalised likelihood — the probability of the data given the star and calibration 
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parameters — depends, of course, on both the star and the calibration parameters; 
the marginalisation is the only conservative and accurate way to propagate uncer- 
tainties about calibration into the astrophysical parameters of interest. The prior 
p(0) is required as a measure on the parameter space that permits integration. 

If the entries are defined this way, a catalogue user can make approximations 
to likelihood comparisons at the pixel level. Unfortunately, this proposal is only 
sensible when the marginalized likelihoods are close to Gaussian in form and when 
star-star covariances can be ignored. These covariances are not expected to be 
negligible for Gaia (see Holl this volume). However, this will be a problem for 
almost all of the currently conceived plans for the Gaia catalogue. Transmission 
of star-star covariances is permitted by our next proposal. 

2 A sampling in catalogue space 

At the present day, in most situations of fitting large (non-parametric, or highly 
parameterised) models to large data sets, investigators report uncertainty infor- 
mation through samplings. A sampling has the disadvantage that it is a mixture- 
of-delta-functions approximation, but it has the great advantage that it can, in 
principle, describe an arbitrarly complicated PDF. There are also excellent sam- 
pling technologies available (e.g., |MacKay 2 003). 

A useful sampling of Gaia catalogues could have the following properties: There 
would be K catalogues that represent a sampling — perhaps not a Poisson sam- 
pling, but a sampling — from the posterior PDF in what you might call "catalogue 
space" . Any experiment or measurement is performed on all K samples. The 
variance across the K outcomes of the K identical experiments is an estimate of 
the uncertainty on the results of the outcome on the primary catalogue. In this 
vision, the the sampling provides a rank-i£T approximation to the full-catalogue 
covariance matrix (which is billions by billions and non-sparse). 

Importantly, in this vision, the sampling of catalogues should be a sampling 
not just in the space of the astrophysical parameters but also in the space of the 
attitude and calibration parameters (the nuisance parameters) . This ensures that 
the variance over samples contains the propagated uncertainty from the nuisance 
parameters, and it doesn't add significantly to any of the technical challenges of 
producing the sampling. 

Although there is an enormous literature on sampling methods for large prob- 
lems, there is a useful hack that might be appropriate if practical sampling meth- 
ods fail on a problem of this scale: Approximate samples can be made by inflating 
leave-one-out jackknife trials. The idea would be to cut the raw data stream into 
K equally informative (roughly speaking, equal-sized) disjoint subsamples, and 
construct the K complementary leave-one-out subsamples, in each of which one of 
the K disjoint subsamples is left out. All data analysis (from raw data to inference 
of all nuisance and astrophysical parameters through to catalogue generation) is 
performed on each of the K leave-one-out subsamples. The difference (in catalogue 
space) between the primary catalogue and each of the K leave-one-out catalogues 
can be amplified by a factor of (very close to) K to produce a sampling that has 
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the same full-catalogue variance and covariance as the classical jackknife estimate. 
This might be an expedient way to produce and publish the jackknife estimate of 
the full error covariance, that has most of the good properties of a sampling. 

Technically, a sampling is a description not of the likelihood but of a posterior 
PDF, and it has the disadvantage to its users that it has had a prior PDF mul- 
tiplied in. This is a fundamental limitation of the approach, but for most kinds 
of catalogue outputs astronomers are interested in, the data are so informative 
that the posterior PDF in catalogue space looks very much like the likelihood 
function unless the adopted priors are extremely informative. Another issue is 
that additional information (say, from another telescope or instrument), which is 
expressed as a multiplicative likelihood term, can cause huge changes in the rel- 
ative posterior probabilities of the K catalogue samples. Indeed, all K samples 
may be essentially ruled out by new information, or the samples may fail to cap- 
ture an important mode in catalogue space. This is a fundamental limitation of 
sample-based representations. 

In an ideal but challenging world, a sampling in catalogue space would explore 
regions of differing complexity. Not sure if a star is binary? Some of the K cata- 
logues would have it binary and some would have it single. There is no reason in 
principle that the catalogues would be qualitatively similar in cases in which there 
is real statistical scientific uncertainty. This complexifies substantially sampling 
strategies, and complexifies downstream analyses by users, so we leave it here only 
as a comment. Changes to catalogue complexity are handled very naturally by 
our third proposal. 



3 Exposing the full likelihood function 

All this has been about approximations to the likelihood function, which begs 
the question: Why not just publish the likelihood function itself? Here what 
we imagine is an interface (perhaps in the cloud; see O'Mullane this volume) to 
which a user could submit a catalogue diff — a difference between the primary Gaia 
catalogue and the catalogue he or she wants to test. The machinery behind the 
interface would use the modified catalogue to generate the raw pixels, compare to 
the data, and return the likelihood diff. Of course the user must be permitted to 
modify also nuisance parameters, or else be presented with the option of obtaining 
the results marginalised over these. 

An interface of this kind would be expensive to build, run, and maintain; it is 
probably impractical at the present day. However, it would permit literally arbi- 
trary hypothesis testing, and arbitrary calculation of covariances. It would also 
make it the responsibility of users to perform their own uncertainty analysis and 
propagation, relieving the teams of some of their most burdensome duties. Ex- 
tremely clever systems could cache computation so subsequent users could benefit 
from the users that have come before. However, without technology development, 
this proposal is probably impractical for the Gaia data. 
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4 Discussion 

How big do you have to make K ? There is no simple answer; this depends on the 
complexity of the posterior PDF and the precision with which users need to know 
their uncertainties. Our attitude is that you never — in any finite experiment — 
know your uncertainties to high precision, so we have the intuition that K does 
not have to be large to be an enormous improvement over a single catalogue with 
reported single-star uncertainties. One order-of-magnitude option for K in the 
case of Gaia is the number of visits, the number of times the spacecraft scans 
across a typical star. Jackknifmg more finely than this will not help much on 
individual-star uncertainties. 

But if anyone can make their own catalogue, the data will be changing; the 
science won't be repeatable! This point is good; published science needs to be 
repeatable by referees and subsequent investigators. On the other hand, our un- 
derstanding of the Gaia data will necessarily evolve as calibrations and physical 
models evolve. There is no hope for a completely stable catalogue. What is fixed 
is the set of raw pixel data. The approaches advocated here make the raw data 
the primary objects of release, and therefore tie the release to the only part of the 
data analysis that is stable and repeatable. Of course, it is necessary for reasons of 
repeatability to clearly tag, cut, and release well-documented and stable versions 
of the data, likelihood function, and primary catalogue. 

Is there any relevant difference between a Bayesian and a frequentist? Bayesian- 
ism is required for marginalization, because the prior provides the measure for 
integration. However, Bayesians and frequentists agree — or should agree — that 
the most important publication of an experiment is the likelihood function. It is 
through the likelihood that data impact our beliefs; different investigators have 
different priors, different objectives, and different external data at their disposal. 
They can only combine these with the Gaia data properly if the likelihood function 
is exposed. Even if the Gaia outputs end up being a sampling of some posterior 
PDF, those samples should come with evaluations of the prior, so that subsequent 
users can divide it out before combining with their own individual knowledge and 
data. 
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NASA (grant NNX08AJ48G), the NSF (grant AST-0908357), and a Research Fellowship of the 
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